Working with Data in R

September 9 + 11, 2024

Jo Hardin

Much of this material can be found at the introduction to dplyr vignette.

Agenda 9/9/24

  1. Tidy data
  2. Data verbs

The data

What does a data set look like?

  • Observations down the rows
  • Variables across the columns
  • Flat file versus relational database.

Active Duty Military

The Active Duty data are not tidy! What are the cases? How are the data not tidy? What might the data look like in tidy form? Suppose that the case was “an individual in the armed forces.” What variables would you use to capture the information in the following table?

Tidy packages: the tidyverse

image credit: https://www.tidyverse.org/.

Reading in data from a file

Hosted online:

movies <- read_csv("http://pages.pomona.edu/~jsh04747/courses/math58/Math58Data/movies2.csv")

Hosted locally:

movies <- read_csv("movies2.csv")

Things to note:

  • The assign arrow is used to create objects in R, which are stored in your environment.
  • Object names don’t have to correspond to file names.
  • Be sure R knows where to look for the file!

Viewing data - the viewer / Environment

  • View() can be used in RStudio to bring up an excel-style spreadsheet. Only for viewing, not editing!
  • The dimensions of the data are found in the environment pane.
  • The names of the variables are seen at the top of the viewer.
  • View() has a capital letter V.
  • View() should not be used in the Quarto document.

Viewing data - inside .qmd / the console

  • dim() is used to find the dimensions (rows x columns).
  • names() is used to find the names of the variables.
  • head() is used to print the first several lines of the dataset to the console.

Practice

  1. What are the dimensions of the data set?
  2. What are the variables?
  3. What appears to be the unit of observation?
dim(movies)
[1] 134   5
names(movies)
[1] "...1"        "score2"      "rating2"     "genre2"      "box office2"
head(movies,3)
# A tibble: 3 × 5
  ...1             score2 rating2 genre2     `box office2`
  <chr>             <dbl> <chr>   <chr>              <dbl>
1 2 Fast 2 Furious   48.9 PG-13   action             127. 
2 28 Days Later      78.2 R       horror              45.1
3 A Guy Thing        39.5 PG-13   rom comedy          15.5

Reading in data from a package

For now, we’ll work with all flights out of the three NYC airports in 2013.

  1. Download and install the package from CRAN (done in the Console, only once).
install.packages("nycflights13")
  1. Load the package (in the .qmd file, need it for the .qmd file to compile appropriately).
library(nycflights13)
  1. Make the data set visible.
data(flights)
  1. Get help.
?flights

Slice and dice with dplyr (a package within the tidyverse)

dplyr

Whenever you’re learning a new tool, for a long time you’re going to suck … but the good news is that is typical, that’s something that happens to everyone, and it’s only temporary.

-Hadley Wickham

Why dplyr?

Data sets are often of high volume (lots of rows) and high variety (lots of columns). This is overwhelming to visualize and analyze, so we find ourselves chopping the data set up into more manageable and meaningful chunks. We also often need to perform operations to organize and clean our data.

This is all possible in base R, but with dplyr, it is simple, readable, and fast.

Verbs

Most data wrangling happens with a set of data verbs. Verbs are functions that act on dataframes.

The first argument of each data verb is the dataframe.

Some Basic Verbs

  • filter()
  • arrange()
  • select()
  • distinct()
  • mutate()
  • summarize()
  • sample_n()

filter()

Allows you to select a subset of the rows of a data frame. The first argument is the name of the data frame, the following arguments are the filters that you’d like to apply

For all flights on January 1st:

filter(flights, month == 1, day == 1)
# A tibble: 842 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 832 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Constructing filters

Filters are constructed of logical operators: <, >, <=, >=, ==, != (and some others).

Adding them one by one to filter() is akin to saying “this AND that”. To say “this OR that OR both”, use |.

filter(flights, month == 1 | month == 2)
# A tibble: 51,955 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 51,945 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Practice

Construct filters to isolate:

  1. Flights that left on St. Patrick’s Day.
  2. Flights that were destined for Chicago’s primary airport.
  3. Flights that were destined for Chicago’s primary airport and were operated by United Airlines.
  4. Flights with flight times more than 2000 miles or that were in the air more than 5 hours.

Solution

  1. Flights that left on St. Patrick’s Day.
  2. Flights that were destined for Chicago’s primary airport.
  3. Flights that were destined for Chicago’s primary airport and were operated by United Airlines.
  4. Flights with flight times more than 2000 miles or that were in the air more than 5 hours.
filter(flights, month == 3, day == 17)
filter(flights, dest == "ORD")
filter(flights, dest == "ORD", carrier == "UA")
filter(flights, distance > 2000 | air_time > 5*60)

arrange()

arrange() reorders the rows: It takes a data frame, and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:

arrange(flights, year, month, day)

Use desc() to sort in descending order.

arrange(flights, desc(arr_delay))

select()

Often you work with large datasets with many columns where only a few are actually of interest to you. select() allows you to rapidly zoom in on a useful subset using operations that usually only work on numeric variable positions:

select(flights, year, month, day)

You can exclude columns using - and specify a range using :.

select(flights, -(year:day))

distinct()

A common use of select() is to find out which values a set of variables takes. This is particularly useful in conjunction with the distinct() verb which only returns the unique values in a table.

What do the following data correspond to?

distinct(select(flights, origin, dest))
# A tibble: 224 × 2
   origin dest 
   <chr>  <chr>
 1 EWR    IAH  
 2 LGA    IAH  
 3 JFK    MIA  
 4 JFK    BQN  
 5 LGA    ATL  
 6 EWR    ORD  
 7 EWR    FLL  
 8 LGA    IAD  
 9 JFK    MCO  
10 LGA    ORD  
# ℹ 214 more rows

mutate()

As well as selecting from the set of existing columns, it’s often useful to add new columns that are functions of existing columns. This is the job of mutate():

select(mutate(flights, gain = dep_delay - arr_delay), 
       flight, dep_delay, arr_delay, gain)
# A tibble: 336,776 × 4
   flight dep_delay arr_delay  gain
    <int>     <dbl>     <dbl> <dbl>
 1   1545         2        11    -9
 2   1714         4        20   -16
 3   1141         2        33   -31
 4    725        -1       -18    17
 5    461        -6       -25    19
 6   1696        -4        12   -16
 7    507        -5        19   -24
 8   5708        -3       -14    11
 9     79        -3        -8     5
10    301        -2         8   -10
# ℹ 336,766 more rows

summarize() and sample_n()

summarize() collapses a data frame to a single row based on some function. It’s not very useful yet, but it will be.

sample_n() provides you with a random sample of rows.

summarize(flights, delay = mean(dep_delay, na.rm = TRUE))
# A tibble: 1 × 1
  delay
  <dbl>
1  12.6
sample_n(flights, 10)
# A tibble: 10 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     2     4     2306           2250        16     2357           2353
 2  2013    11    12     1221           1215         6     1409           1445
 3  2013     9    21      939            940        -1     1211           1235
 4  2013    11    19      727            730        -3      830            844
 5  2013     8    15      651            655        -4      942            930
 6  2013     7     8     1015           1025       -10     1204           1222
 7  2013     7    30     1159           1200        -1     1318           1310
 8  2013     7    11     1743           1735         8     2018           2030
 9  2013     5    20     2114           2117        -3     2335           2351
10  2013     2    24     1443           1445        -2     1652           1710
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Practice

Mutate the data to create a new column that contains the average speed traveled by the plane for each flight.

Select the new variable and save it, along with tailnum, as a new data frame object.

Practice

Mutate the data to create a new column that contains the average speed traveled by the plane for each flight.

Select the new variable and save it, along with tailnum, as a new data frame object.

flights2 <- mutate(flights, speed = distance/(air_time/60))
speed_data <- select(flights2, tailnum, speed)

group_by()

summarize() and sample_n() are even more powerful when combined with the idea of “group by”, repeating the operation separately on groups of observations within the dataset.

The group_by() function describes how to break a dataset down into groups of rows.

group_by()

Find the fastest airplanes in the bunch, measured as the average speed per airplane.

by_tailnum <- group_by(speed_data, tailnum)
avg_speed <- summarize(by_tailnum, 
                       count = n(), 
                       avg_speed = mean(speed, na.rm = TRUE))
arrange(avg_speed, desc(avg_speed))
# A tibble: 4,044 × 3
   tailnum count avg_speed
   <chr>   <int>     <dbl>
 1 N228UA      1      501.
 2 N315AS      1      499.
 3 N654UA      1      499.
 4 N819AW      1      490.
 5 N382HA     26      486.
 6 N388HA     36      484.
 7 N391HA     21      484.
 8 N777UA      1      483.
 9 N385HA     28      483.
10 N392HA     13      482.
# ℹ 4,034 more rows

Chaining

Instead of applying each verb step-by-step, we can chain them into a single data pipeline, connected with the |> operator. You start the pipeline with a data frame and then pass it to each function in turn.

The pipe syntax (|>) takes a data frame and sends it to the argument of a function. The mapping goes to the first available argument in the function. For example:

x |> f() is the same as f(x)

x |> f(y) is the same as f(x, y)

Mornings

me_step1 <- dress(me, what = sports)  
me_step2 <- exercise(me_step1, how = running)  
me_step3 <- eat(me_step2, choice = cereal)  
me_step4 <- dress(me_step3, what = school)  
me_step5 <- commute(me_step4, transportation = bike)  

Mornings

commute(dress(eat(exercise(dress(me, what = sports), how = running), choice = cereal), what = school), transportation = bike)

Morning

(better??)

commute(
  dress(
    eat(
      exercise(
        dress(me, 
              what = sports), 
        how = running), 
      choice = cereal), 
    what = school), 
  transportation = bike)

Mornings

me |> 
  dress(what = sports) |> 
  exercise(how = running) |> 
  eat(choice = cereal) |> 
  dress(what = school) |> 
  commute(transportation = bike)

Little Bunny Foo Foo

From Hadley Wickham, how to think about tidy data.

Little bunny Foo Foo

Went hopping through the forest

Scooping up the field mice

And bopping them on the head

Little Bunny Foo Foo

The nursery rhyme could be created by a series of steps where the output from each step is saved as an object along the way.

foo_foo <- little_bunny()
foo_foo_1 <- hop(foo_foo, through = forest)
foo_foo_2 <- scoop(foo_foo_2, up = field_mice)
foo_foo_3 <- bop(foo_foo_2, on = head)

Little Bunny Foo Foo

Another approach is to concatenate the functions so that there is only one output.

bop(
   scoop(
      hop(foo_foo, through = forest),
      up = field_mice),
   on = head)

Little Bunny Foo Foo

Or even worse, as one line:

bop(scoop(hop(foo_foo, through = forest), up = field_mice), on = head)))

Little Bunny Foo Foo

Instead, the code can be written using the pipe in the order in which the function is evaluated:

foo_foo |> 
   hop(through = forest) |> 
       scoop(up = field_mice) |> 
           bop(on = head)

Flights

flights2 <- mutate(flights, speed = distance/(air_time/60))
tail_speed <- select(flights2, tailnum, speed)
tail_speed_grp <- group_by(tail_speed, tailnum)
tail_ave <- summarize(tail_speed_grp, number = n(),
                      avg_speed = mean(speed, na.rm = TRUE))
arrange(tail_ave, desc(avg_speed))
# A tibble: 4,044 × 3
   tailnum number avg_speed
   <chr>    <int>     <dbl>
 1 N228UA       1      501.
 2 N315AS       1      499.
 3 N654UA       1      499.
 4 N819AW       1      490.
 5 N382HA      26      486.
 6 N388HA      36      484.
 7 N391HA      21      484.
 8 N777UA       1      483.
 9 N385HA      28      483.
10 N392HA      13      482.
# ℹ 4,034 more rows
flights |> 
  mutate(speed = distance / (air_time/60)) |> 
  select(tailnum, speed) |> 
  group_by(tailnum) |>  
  summarize(number = n(), 
            avg_speed = mean(speed, na.rm = TRUE)) |> 
  arrange(desc(avg_speed))
# A tibble: 4,044 × 3
   tailnum number avg_speed
   <chr>    <int>     <dbl>
 1 N228UA       1      501.
 2 N315AS       1      499.
 3 N654UA       1      499.
 4 N819AW       1      490.
 5 N382HA      26      486.
 6 N388HA      36      484.
 7 N391HA      21      484.
 8 N777UA       1      483.
 9 N385HA      28      483.
10 N392HA      13      482.
# ℹ 4,034 more rows

Practice

Form a chain that creates a data frame containing only carrier and each carrier’s mean departure delay time. Which carriers have the highest and lowest mean delays?

Practice

Form a chain that creates a data frame containing only carrier and the mean departure delay time. Which carriers have the highest and lowest mean delays?

flights |> 
  group_by(carrier) |> 
  summarize(avg_delay = mean(dep_delay, na.rm = TRUE)) |> 
  arrange(desc(avg_delay))
# A tibble: 16 × 2
   carrier avg_delay
   <chr>       <dbl>
 1 F9          20.2 
 2 EV          20.0 
 3 YV          19.0 
 4 FL          18.7 
 5 WN          17.7 
 6 9E          16.7 
 7 B6          13.0 
 8 VX          12.9 
 9 OO          12.6 
10 UA          12.1 
11 MQ          10.6 
12 DL           9.26
13 AA           8.59
14 AS           5.80
15 HA           4.90
16 US           3.78

Practice again

Say you’re curious about the relationship between the number of flights that each plane made in 2013, the mean distance that each of those planes flew, and the mean arrival delay. You also want to exclude the edge cases from your analysis, so focus on the planes that have logged more than 20 flights and flown an average distance of less than 2000 miles. Please form the chain that creates this dataset.

Solution

delay_data <- flights |> 
  group_by(tailnum) |> 
  summarize(number = n(),
            dist = mean(distance, na.rm = TRUE), 
            delay = mean(arr_delay, na.rm = TRUE)) |> 
  filter(number > 20, dist < 2000)

Visualizing the data

delay_data |> 
 ggplot(aes(dist, delay)) +
 geom_point(aes(size = number), 
            alpha = 1/2) +
 geom_smooth() +
 scale_size_area()

When scale_size_area is used, the default behavior is to scale the area of points to be proportional to the value.

Agenda 9/11/24

  1. pivoting
  2. Relational data (_join)
  3. lubridate

pivoting

image credit: https://www.garrickadenbuie.com/project/tidyexplain/

From wide to long and long to wide

  • pivot_longer() makes the dataframe “longer” – many columns into a few columns (more rows): pivot_longer(data, cols, names_to = , value_to = )
  • pivot_wider() makes the dataframe “wider” – a few columns into many columns (fewer rows): pivot_wider(data, names_from = , values_from = )

From wide to long and long to wide

pivot_longer

pivot_longer will be demonstrated using datasets from GapMinder.

litF represents country, year, and female literacy rate.

library(googlesheets4)
gs4_deauth()
litF <- read_sheet("https://docs.google.com/spreadsheets/d/1hDinTIRHQIaZg1RUn6Z_6mo12PtKwEPFIz_mJVF6P5I/pub?gid=0")

litF
# A tibble: 260 × 38
   Adult (15+) literacy rate …¹ `1975` `1976` `1977` `1978` `1979` `1980` `1981`
   <chr>                         <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 Afghanistan                      NA     NA     NA     NA   4.99   NA       NA
 2 Albania                          NA     NA     NA     NA  NA      NA       NA
 3 Algeria                          NA     NA     NA     NA  NA      NA       NA
 4 Andorra                          NA     NA     NA     NA  NA      NA       NA
 5 Angola                           NA     NA     NA     NA  NA      NA       NA
 6 Anguilla                         NA     NA     NA     NA  NA      NA       NA
 7 Antigua and Barbuda              NA     NA     NA     NA  NA      NA       NA
 8 Argentina                        NA     NA     NA     NA  NA      93.6     NA
 9 Armenia                          NA     NA     NA     NA  NA      NA       NA
10 Aruba                            NA     NA     NA     NA  NA      NA       NA
# ℹ 250 more rows
# ℹ abbreviated name: ¹​`Adult (15+) literacy rate (%). Female`
# ℹ 30 more variables: `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>,
#   `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>,
#   `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>,
#   `1996` <dbl>, `1997` <dbl>, `1998` <dbl>, `1999` <dbl>, `2000` <dbl>,
#   `2001` <dbl>, `2002` <dbl>, `2003` <dbl>, `2004` <dbl>, `2005` <dbl>, …

pivot_longer

litF <- litF |> 
  select(country = starts_with("Adult"), everything()) |> 
  pivot_longer(cols = -country,
               names_to = "year", 
               values_to = "litRateF") |> 
  filter(!is.na(litRateF))

litF
# A tibble: 571 × 3
   country     year  litRateF
   <chr>       <chr>    <dbl>
 1 Afghanistan 1979      4.99
 2 Afghanistan 2011     13   
 3 Albania     2001     98.3 
 4 Albania     2008     94.7 
 5 Albania     2011     95.7 
 6 Algeria     1987     35.8 
 7 Algeria     2002     60.1 
 8 Algeria     2006     63.9 
 9 Angola      2001     54.2 
10 Angola      2011     58.6 
# ℹ 561 more rows

pivot_longer

GDP represents country, year, and gdp (in fixed 2000 US$).

GDP <- read_sheet("https://docs.google.com/spreadsheets/d/1RctTQmKB0hzbm1E8rGcufYdMshRdhmYdeL29nXqmvsc/pub?gid=0")

GDP <- GDP |>  
  select(country = starts_with("Income"), everything()) |> 
  pivot_longer(cols = -country, 
               names_to = "year",
               values_to = "gdp") |> 
  filter(!is.na(gdp))

GDP
# A tibble: 7,988 × 3
   country year    gdp
   <chr>   <chr> <dbl>
 1 Albania 1980  1061.
 2 Albania 1981  1100.
 3 Albania 1982  1111.
 4 Albania 1983  1101.
 5 Albania 1984  1065.
 6 Albania 1985  1060.
 7 Albania 1986  1092.
 8 Albania 1987  1054.
 9 Albania 1988  1014.
10 Albania 1989  1092.
# ℹ 7,978 more rows

pivot_wider

pivot_wider will be demonstrated using the babynames dataset.

library(babynames)

babynames
# A tibble: 1,924,665 × 5
    year sex   name          n   prop
   <dbl> <chr> <chr>     <int>  <dbl>
 1  1880 F     Mary       7065 0.0724
 2  1880 F     Anna       2604 0.0267
 3  1880 F     Emma       2003 0.0205
 4  1880 F     Elizabeth  1939 0.0199
 5  1880 F     Minnie     1746 0.0179
 6  1880 F     Margaret   1578 0.0162
 7  1880 F     Ida        1472 0.0151
 8  1880 F     Alice      1414 0.0145
 9  1880 F     Bertha     1320 0.0135
10  1880 F     Sarah      1288 0.0132
# ℹ 1,924,655 more rows

pivot_wider babynames

babynames |>  
  select(-prop) |> 
  pivot_wider(names_from = sex, values_from = n) 
# A tibble: 1,756,284 × 4
    year name          F     M
   <dbl> <chr>     <int> <int>
 1  1880 Mary       7065    27
 2  1880 Anna       2604    12
 3  1880 Emma       2003    10
 4  1880 Elizabeth  1939     9
 5  1880 Minnie     1746     9
 6  1880 Margaret   1578    NA
 7  1880 Ida        1472     8
 8  1880 Alice      1414    NA
 9  1880 Bertha     1320    NA
10  1880 Sarah      1288    NA
# ℹ 1,756,274 more rows

pivot_wider babynames

babynames |>  
  select(-prop) |>  
  pivot_wider(names_from = sex, values_from = n) |> 
  filter(!is.na(F), !is.na(M)) |> 
  arrange(desc(year), desc(F))
# A tibble: 168,381 × 4
    year name          F     M
   <dbl> <chr>     <int> <int>
 1  2017 Emma      19738    14
 2  2017 Olivia    18632    10
 3  2017 Ava       15902    12
 4  2017 Isabella  15100    12
 5  2017 Sophia    14831    17
 6  2017 Mia       13437    16
 7  2017 Charlotte 12893     7
 8  2017 Amelia    11800    11
 9  2017 Evelyn    10675    11
10  2017 Abigail   10551     6
# ℹ 168,371 more rows

pivot_wider babynames

babynames |>  
  pivot_wider(names_from = sex, values_from = n) |> 
  mutate(maxcount = pmax(F, M, na.rm = TRUE)) |> 
  arrange(desc(maxcount))
# A tibble: 1,924,653 × 6
    year name      prop     F     M maxcount
   <dbl> <chr>    <dbl> <int> <int>    <int>
 1  1947 Linda   0.0548 99686    NA    99686
 2  1948 Linda   0.0552 96209    NA    96209
 3  1947 James   0.0510    NA 94756    94756
 4  1957 Michael 0.0424    NA 92695    92695
 5  1947 Robert  0.0493    NA 91642    91642
 6  1949 Linda   0.0518 91016    NA    91016
 7  1956 Michael 0.0423    NA 90620    90620
 8  1958 Michael 0.0420    NA 90520    90520
 9  1948 James   0.0497    NA 88588    88588
10  1954 Michael 0.0428    NA 88514    88514
# ℹ 1,924,643 more rows

Relational data (multiple data frames)

See the Posit cheatsheets on wrangling & joining and pivoting.

Joining two (or more) dataframes:

  • left_join returns all rows from the left table, and any rows with matching keys from the right table.
  • inner_join returns only the rows in which the left table have matching keys in the right table (i.e., matching rows in both sets).
  • full_join returns all rows from both tables, join records from the left which have matching keys in the right table.

Good practice: always specify the by argument when joining data frames.

Women in Science

10 women in science who changed the world (source: Discover Magazine)1

name profession
Ada Lovelace Mathematician
Marie Curie Physicist and Chemist
Janaki Ammal Botanist
Chien-Shiung Wu Physicist
Katherine Johnson Mathematician
Rosalind Franklin Chemist
Vera Rubin Astronomer
Gladys West Mathematician
Flossie Wong-Staal Virologist and Molecular Biologist
Jennifer Doudna Biochemist

Inputs

professions
# A tibble: 10 × 2
   name               profession                        
   <chr>              <chr>                             
 1 Ada Lovelace       Mathematician                     
 2 Marie Curie        Physicist and Chemist             
 3 Janaki Ammal       Botanist                          
 4 Chien-Shiung Wu    Physicist                         
 5 Katherine Johnson  Mathematician                     
 6 Rosalind Franklin  Chemist                           
 7 Vera Rubin         Astronomer                        
 8 Gladys West        Mathematician                     
 9 Flossie Wong-Staal Virologist and Molecular Biologist
10 Jennifer Doudna    Biochemist                        
dates
# A tibble: 8 × 3
  name               birth_year death_year
  <chr>                   <dbl>      <dbl>
1 Janaki Ammal             1897       1984
2 Chien-Shiung Wu          1912       1997
3 Katherine Johnson        1918       2020
4 Rosalind Franklin        1920       1958
5 Vera Rubin               1928       2016
6 Gladys West              1930         NA
7 Flossie Wong-Staal       1947       2020
8 Jennifer Doudna          1964         NA
works
# A tibble: 9 × 2
  name               known_for                                                  
  <chr>              <chr>                                                      
1 Ada Lovelace       first computer algorithm                                   
2 Marie Curie        theory of radioactivity,  discovery of elements polonium a…
3 Janaki Ammal       hybrid species, biodiversity protection                    
4 Chien-Shiung Wu    confim and refine theory of radioactive beta decy, Wu expe…
5 Katherine Johnson  calculations of orbital mechanics critical to sending the …
6 Vera Rubin         existence of dark matter                                   
7 Gladys West        mathematical modeling of the shape of the Earth which serv…
8 Flossie Wong-Staal first scientist to clone HIV and create a map of its genes…
9 Jennifer Doudna    one of the primary developers of CRISPR, a ground-breaking…

Desired output

We’d like to put together the data to look like:

# A tibble: 10 × 5
   name               profession                 birth_year death_year known_for
   <chr>              <chr>                           <dbl>      <dbl> <chr>    
 1 Ada Lovelace       Mathematician                      NA         NA first co…
 2 Marie Curie        Physicist and Chemist              NA         NA theory o…
 3 Janaki Ammal       Botanist                         1897       1984 hybrid s…
 4 Chien-Shiung Wu    Physicist                        1912       1997 confim a…
 5 Katherine Johnson  Mathematician                    1918       2020 calculat…
 6 Rosalind Franklin  Chemist                          1920       1958 <NA>     
 7 Vera Rubin         Astronomer                       1928       2016 existenc…
 8 Gladys West        Mathematician                    1930         NA mathemat…
 9 Flossie Wong-Staal Virologist and Molecular …       1947       2020 first sc…
10 Jennifer Doudna    Biochemist                       1964         NA one of t…

Inputs, reminder

nrow(professions)
[1] 10
nrow(dates)
[1] 8
nrow(works)
[1] 9
names(professions)
[1] "name"       "profession"
names(dates)
[1] "name"       "birth_year" "death_year"
names(works)
[1] "name"      "known_for"

Setup

For the next few slides…

x
# A tibble: 3 × 2
     id value_x
  <dbl> <chr>  
1     1 x1     
2     2 x2     
3     3 x3     
y
# A tibble: 3 × 2
     id value_y
  <dbl> <chr>  
1     1 y1     
2     2 y2     
3     4 y4     

left_join()

image credit: https://www.garrickadenbuie.com/project/tidyexplain/
left_join(x, y, by = "id")
# A tibble: 3 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     
3     3 x3      <NA>   

left_join()

professions |> 
  left_join(dates, by = "name") 
# A tibble: 10 × 4
   name               profession                         birth_year death_year
   <chr>              <chr>                                   <dbl>      <dbl>
 1 Ada Lovelace       Mathematician                              NA         NA
 2 Marie Curie        Physicist and Chemist                      NA         NA
 3 Janaki Ammal       Botanist                                 1897       1984
 4 Chien-Shiung Wu    Physicist                                1912       1997
 5 Katherine Johnson  Mathematician                            1918       2020
 6 Rosalind Franklin  Chemist                                  1920       1958
 7 Vera Rubin         Astronomer                               1928       2016
 8 Gladys West        Mathematician                            1930         NA
 9 Flossie Wong-Staal Virologist and Molecular Biologist       1947       2020
10 Jennifer Doudna    Biochemist                               1964         NA

right_join()

image credit: https://www.garrickadenbuie.com/project/tidyexplain/
right_join(x, y, by = "id")
# A tibble: 3 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     
3     4 <NA>    y4     

right_join()

professions |> 
  right_join(dates, by = "name") 
# A tibble: 8 × 4
  name               profession                         birth_year death_year
  <chr>              <chr>                                   <dbl>      <dbl>
1 Janaki Ammal       Botanist                                 1897       1984
2 Chien-Shiung Wu    Physicist                                1912       1997
3 Katherine Johnson  Mathematician                            1918       2020
4 Rosalind Franklin  Chemist                                  1920       1958
5 Vera Rubin         Astronomer                               1928       2016
6 Gladys West        Mathematician                            1930         NA
7 Flossie Wong-Staal Virologist and Molecular Biologist       1947       2020
8 Jennifer Doudna    Biochemist                               1964         NA

full_join()

image credit: https://www.garrickadenbuie.com/project/tidyexplain/
full_join(x, y, by = "id")
# A tibble: 4 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     
3     3 x3      <NA>   
4     4 <NA>    y4     

full_join()

dates |> 
  full_join(works, by = "name") 
# A tibble: 10 × 4
   name               birth_year death_year known_for                           
   <chr>                   <dbl>      <dbl> <chr>                               
 1 Janaki Ammal             1897       1984 hybrid species, biodiversity protec…
 2 Chien-Shiung Wu          1912       1997 confim and refine theory of radioac…
 3 Katherine Johnson        1918       2020 calculations of orbital mechanics c…
 4 Rosalind Franklin        1920       1958 <NA>                                
 5 Vera Rubin               1928       2016 existence of dark matter            
 6 Gladys West              1930         NA mathematical modeling of the shape …
 7 Flossie Wong-Staal       1947       2020 first scientist to clone HIV and cr…
 8 Jennifer Doudna          1964         NA one of the primary developers of CR…
 9 Ada Lovelace               NA         NA first computer algorithm            
10 Marie Curie                NA         NA theory of radioactivity,  discovery…

inner_join()

image credit: https://www.garrickadenbuie.com/project/tidyexplain/
inner_join(x, y, by = "id")
# A tibble: 2 × 3
     id value_x value_y
  <dbl> <chr>   <chr>  
1     1 x1      y1     
2     2 x2      y2     

inner_join()

dates |> 
  inner_join(works, by = "name") 
# A tibble: 7 × 4
  name               birth_year death_year known_for                            
  <chr>                   <dbl>      <dbl> <chr>                                
1 Janaki Ammal             1897       1984 hybrid species, biodiversity protect…
2 Chien-Shiung Wu          1912       1997 confim and refine theory of radioact…
3 Katherine Johnson        1918       2020 calculations of orbital mechanics cr…
4 Vera Rubin               1928       2016 existence of dark matter             
5 Gladys West              1930         NA mathematical modeling of the shape o…
6 Flossie Wong-Staal       1947       2020 first scientist to clone HIV and cre…
7 Jennifer Doudna          1964         NA one of the primary developers of CRI…

semi_join()

image credit: https://www.garrickadenbuie.com/project/tidyexplain/
semi_join(x, y, by = "id")
# A tibble: 2 × 2
     id value_x
  <dbl> <chr>  
1     1 x1     
2     2 x2     

semi_join()

dates |> 
  semi_join(works, by = "name") 
# A tibble: 7 × 3
  name               birth_year death_year
  <chr>                   <dbl>      <dbl>
1 Janaki Ammal             1897       1984
2 Chien-Shiung Wu          1912       1997
3 Katherine Johnson        1918       2020
4 Vera Rubin               1928       2016
5 Gladys West              1930         NA
6 Flossie Wong-Staal       1947       2020
7 Jennifer Doudna          1964         NA

anti_join()

image credit: https://www.garrickadenbuie.com/project/tidyexplain/
anti_join(x, y, by = "id")
# A tibble: 1 × 2
     id value_x
  <dbl> <chr>  
1     3 x3     

anti_join()

dates |> 
  anti_join(works, by = "name") 
# A tibble: 1 × 3
  name              birth_year death_year
  <chr>                  <dbl>      <dbl>
1 Rosalind Franklin       1920       1958

Putting it altogether

professions |> 
  left_join(dates, by = "name") |> 
  left_join(works, by = "name")
# A tibble: 10 × 5
   name               profession                 birth_year death_year known_for
   <chr>              <chr>                           <dbl>      <dbl> <chr>    
 1 Ada Lovelace       Mathematician                      NA         NA first co…
 2 Marie Curie        Physicist and Chemist              NA         NA theory o…
 3 Janaki Ammal       Botanist                         1897       1984 hybrid s…
 4 Chien-Shiung Wu    Physicist                        1912       1997 confim a…
 5 Katherine Johnson  Mathematician                    1918       2020 calculat…
 6 Rosalind Franklin  Chemist                          1920       1958 <NA>     
 7 Vera Rubin         Astronomer                       1928       2016 existenc…
 8 Gladys West        Mathematician                    1930         NA mathemat…
 9 Flossie Wong-Staal Virologist and Molecular …       1947       2020 first sc…
10 Jennifer Doudna    Biochemist                       1964         NA one of t…

Practice litF and GDP from Gapminder.

left

litGDPleft <- left_join(litF, GDP, by=c("country", "year"))
dim(litGDPleft)
[1] 571   4
litGDPleft
# A tibble: 571 × 4
   country     year  litRateF   gdp
   <chr>       <chr>    <dbl> <dbl>
 1 Afghanistan 1979      4.99   NA 
 2 Afghanistan 2011     13      NA 
 3 Albania     2001     98.3  1282.
 4 Albania     2008     94.7  1804.
 5 Albania     2011     95.7  1966.
 6 Algeria     1987     35.8  1902.
 7 Algeria     2002     60.1  1872.
 8 Algeria     2006     63.9  2125.
 9 Angola      2001     54.2   298.
10 Angola      2011     58.6   630.
# ℹ 561 more rows

inner

litGDPinner <- inner_join(litF, GDP, by=c("country", "year"))
dim(litGDPinner)
[1] 505   4
litGDPinner
# A tibble: 505 × 4
   country             year  litRateF   gdp
   <chr>               <chr>    <dbl> <dbl>
 1 Albania             2001      98.3 1282.
 2 Albania             2008      94.7 1804.
 3 Albania             2011      95.7 1966.
 4 Algeria             1987      35.8 1902.
 5 Algeria             2002      60.1 1872.
 6 Algeria             2006      63.9 2125.
 7 Angola              2001      54.2  298.
 8 Angola              2011      58.6  630.
 9 Antigua and Barbuda 2001      99.4 9640.
10 Antigua and Barbuda 2011      99.4 9978.
# ℹ 495 more rows

full

litGDPfull <- full_join(litF, GDP, by=c("country", "year"))
dim(litGDPfull)
[1] 8054    4
litGDPfull
# A tibble: 8,054 × 4
   country     year  litRateF   gdp
   <chr>       <chr>    <dbl> <dbl>
 1 Afghanistan 1979      4.99   NA 
 2 Afghanistan 2011     13      NA 
 3 Albania     2001     98.3  1282.
 4 Albania     2008     94.7  1804.
 5 Albania     2011     95.7  1966.
 6 Algeria     1987     35.8  1902.
 7 Algeria     2002     60.1  1872.
 8 Algeria     2006     63.9  2125.
 9 Angola      2001     54.2   298.
10 Angola      2011     58.6   630.
# ℹ 8,044 more rows

join to merge two datasets

If you ever need to understand which join is the right join for you, try to find an image that will lay out what the function is doing. I found this one that is quite good and is taken from the Statistics Globe blog.

lubridate

lubridate is a another R package meant for data wrangling!

In particular, lubridate makes it very easy to work with days, times, and dates. The base idea is to start with dates in a ymd (year month day) format and transform the information into whatever you want.

Example from the lubridate vignette.

If anyone drove a time machine, they would crash

The length of months and years change so often that doing arithmetic with them can be unintuitive. Consider a simple operation: January 31st + one month.

If anyone drove a time machine, they would crash

The length of months and years change so often that doing arithmetic with them can be unintuitive. Consider a simple operation: January 31st + one month.

Should the answer be:

  1. February 31st (which doesn’t exist)
  2. March 4th (31 days after January 31), or
  3. February 28th (assuming its not a leap year)

A basic property of arithmetic is that a + b - b = a. Only solution 1 obeys the mathematical property, but it is an invalid date. Wickham wants to make lubridate as consistent as possible by invoking the following rule: if adding or subtracting a month or a year creates an invalid date, lubridate will return an NA.

If you thought solution 2 or 3 was more useful, no problem. You can still get those results with clever arithmetic, or by using the special %m+% and %m-% operators. %m+% and %m-% automatically roll dates back to the last day of the month, should that be necessary.

basics in lubridate

library(lubridate)
rightnow <- now()

day(rightnow)
[1] 11
week(rightnow)
[1] 37
month(rightnow, label=FALSE)
[1] 9
month(rightnow, label=TRUE)
[1] Sep
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
year(rightnow)
[1] 2024

basics in lubridate

minute(rightnow)
[1] 14
hour(rightnow)
[1] 10
yday(rightnow)
[1] 255
mday(rightnow)
[1] 11
wday(rightnow, label=FALSE)
[1] 4
wday(rightnow, label=TRUE)
[1] Wed
Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat

But how do I create a date object?

jan31 <- ymd("2021-01-31")
jan31 + months(0:11)
 [1] "2021-01-31" NA           "2021-03-31" NA           "2021-05-31"
 [6] NA           "2021-07-31" "2021-08-31" NA           "2021-10-31"
[11] NA           "2021-12-31"
floor_date(jan31, "month") + months(0:11) + days(31)
 [1] "2021-02-01" "2021-03-04" "2021-04-01" "2021-05-02" "2021-06-01"
 [6] "2021-07-02" "2021-08-01" "2021-09-01" "2021-10-02" "2021-11-01"
[11] "2021-12-02" "2022-01-01"
jan31 + months(0:11) + days(31)
 [1] "2021-03-03" NA           "2021-05-01" NA           "2021-07-01"
 [6] NA           "2021-08-31" "2021-10-01" NA           "2021-12-01"
[11] NA           "2022-01-31"
jan31 %m+% months(0:11)
 [1] "2021-01-31" "2021-02-28" "2021-03-31" "2021-04-30" "2021-05-31"
 [6] "2021-06-30" "2021-07-31" "2021-08-31" "2021-09-30" "2021-10-31"
[11] "2021-11-30" "2021-12-31"

NYC flights

library(nycflights13)
names(flights)
 [1] "year"           "month"          "day"            "dep_time"      
 [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
 [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
[13] "origin"         "dest"           "air_time"       "distance"      
[17] "hour"           "minute"         "time_hour"     

NYC flights

flightsWK <- flights |>  
   mutate(ymdday = ymd(paste(year, month,day, sep="-"))) |> 
   mutate(weekdy = wday(ymdday, label=TRUE), 
          whichweek = week(ymdday)) 

flightsWK |>  select(year, month, day, ymdday, weekdy, whichweek, 
                     dep_time, arr_time, air_time) 
# A tibble: 336,776 × 9
    year month   day ymdday     weekdy whichweek dep_time arr_time air_time
   <int> <int> <int> <date>     <ord>      <dbl>    <int>    <int>    <dbl>
 1  2013     1     1 2013-01-01 Tue            1      517      830      227
 2  2013     1     1 2013-01-01 Tue            1      533      850      227
 3  2013     1     1 2013-01-01 Tue            1      542      923      160
 4  2013     1     1 2013-01-01 Tue            1      544     1004      183
 5  2013     1     1 2013-01-01 Tue            1      554      812      116
 6  2013     1     1 2013-01-01 Tue            1      554      740      150
 7  2013     1     1 2013-01-01 Tue            1      555      913      158
 8  2013     1     1 2013-01-01 Tue            1      557      709       53
 9  2013     1     1 2013-01-01 Tue            1      557      838      140
10  2013     1     1 2013-01-01 Tue            1      558      753      138
# ℹ 336,766 more rows

reprex

Help me help you

reproducible example …

Step 1. Copy code onto the clipboard

Step 2. Type reprex() into the Console

Step 3. Look at the Viewer to the right. Copy the Viewer output into GitHub, Piazza, Discord, an email, stackexchange, etc.

reprex demo

reprex(
  jan31 + months(0:11) + days(31)
)

multiple lines of code:

reprex({
  jan31 <- ymd("2021-01-31")
  jan31 + months(0:11) + days(31)
})
reprex({
  library(lubridate)
  jan31 <- ymd("2021-01-31")
  jan31 + months(0:11) + days(31)
})